Generalized Linear Models III

Chelsea Parlett-Pelleriti

Time Series Data

Time Series data are repeated measurements collected over time from the same system/subject

Repeated Measures with MEMs

When the number of repeated measurements is small (e.g. pre/post/follow up for a psychology intervention) we can use Mixed Effect Models!

Repeated Measures with MEMs

# mem with interaction of treatment and time
repeated_mlm <- lmer(Outcome ~ Treatment * Time +
                       (1 + Time | Participant),
                     data = rt_data)

summary(repeated_mlm)
Linear mixed model fit by REML ['lmerMod']
Formula: Outcome ~ Treatment * Time + (1 + Time | Participant)
   Data: rt_data

REML criterion at convergence: 13724.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.9099 -0.5381  0.0003  0.5243  3.3140 

Random effects:
 Groups      Name        Variance Std.Dev. Corr
 Participant (Intercept) 57.44040 7.5789       
             Time         0.05631 0.2373   0.06
 Residual                 0.96553 0.9826       
Number of obs: 3000, groups:  Participant, 1000

Fixed effects:
                        Estimate Std. Error t value
(Intercept)             49.05630    0.34656 141.550
TreatmentTreatment       0.22623    0.48865   0.463
Time                     0.04587    0.03293   1.393
TreatmentTreatment:Time  0.33615    0.04644   7.239

Correlation of Fixed Effects:
            (Intr) TrtmnT Time  
TrtmntTrtmn -0.709              
Time        -0.152  0.108       
TrtmntTrt:T  0.108 -0.152 -0.709

Repeated Measures with MEMs

Time Series

But often, we have many more observations. We can break these time series down into 3 components:

  • seasonal: variation in value due to repeated patterns over time (e.g. day of week, day of year, day of month, time of day)

  • trend: changes over time in the value

  • noise: variation in the value not due to seasonality or trend

Time Series Decomposition

\[ Y_t = Tr_t + S_t + \epsilon_t \]

Autocorrelation and Partial Autocorrelation

💡 with MCMC samples, we noted autocorrelation in our chains because each sample in the chain was conditioned on the previous sample

Autocorrelation

Autocorrelation

Autocorrelation

Partial Autocorrelation

\[ y_t = \phi_{21} y_{t-1} + \phi_{22} y_{t-2} + \epsilon_t \]

Partial Autocorrelation

\[ \text{pacf} = \frac{cov(y_t,y_{t-2} | y_{t-1})}{\sqrt{var(y_t| y_{t-1})* var(y_{t-2}| y_{t-1})}} \]

Coefficient Interpretation in GLMs

correlation: the correlation between \(x1\) and \(y\)

Coefficient Interpretation in GLMs

semi-partial correlation: the correlation between \(x1\) and \(y\), after subtracting the influence of \(x2\) on \(x1\)

Coefficient Interpretation in GLMs

partial correlation: the correlation between \(x1\) and \(y\), after subtracting the influence of \(x2\) on \(x1\) and subtracting the influence of \(x2\) on \(y\)

Partial Autocorrelation

partial autocorrelation: the correlation between \(y_t\) and \(y_{t-2}\), after subtracting the influence of \(y_{t-1}\) on \(y_t\) and subtracting the influence of \(y_{t-1}\) on \(y_{t-2}\)

Partial Autocorrelation

Stationarity of a TS

  1. \(\mu\) is constant

  2. \(\sigma\) is constant

  3. no seasonality (constant autocorrelation)

Stationarity

Stationarity

Stationarity

Terminology: White Noise

  1. \(\mu\) is 0

  2. \(\sigma\) is constant

  3. no autocorrelation

Are Non-Stationary Time Series Useless?

This time series is not stationary, is there anything we can do?

\[ x_t = \beta_0 + \beta_1*t + \epsilon_t \]

Are Non-Stationary Time Series Useless?

💡 what if we used \(x_t\) to create a new time series, \(z_t\)?

\[ z_t = x_t - x_{t-1} = \\ \left[\beta_0 + \beta_1*t + \epsilon_t\right] - \left[\beta_0 + \beta_1*(t-1) + \epsilon_{t-1}\right] = \\ \beta_1 + (\epsilon_t - \epsilon_{t-1}) \]

❓ what is the mean and variance of \(z_t\)?

Are Non-Stationary Time Series Useless?

\[ \mathbb{E}(z_t) = \mathbb{E}(\beta_1) + \mathbb{E}(\epsilon_t) - \mathbb{E}(\epsilon_{t-1}) = \beta_1 \]

\[ Var(z_t) = Var(\epsilon_t - \epsilon_{t-1}) = 2\sigma^2 \]

Notation: Lag/Backshift Operator

\(B\) is the backshift operator represents a lagged version of a variable. For example:

\[ By_t = y_{t-1} \\ B^2y_t = y_{t-2} \]

AR Model

\[ y = \beta_0 + \beta_1*x_1 + ...+ \beta_n * x_n + \epsilon \]

Review: in GLMs, we use various predictors (like age, height, income) to predict an outcome (like diabetes)

\[ y_t = \beta_0 + \phi_1* y_{t-1} + \phi_2* y_{t-2} + ...+ \phi_n* y_{t-n} + \epsilon_t \]

In Time Series models we’re using lagged observations as predictors for our current observation.

AR Model

An \(AR(p)\) model is an autoregressive model of a stationary time series with the form:

\[ y_t = \beta_0 + \underbrace{\sum_{i=1}^p \phi_iy_{t-i}}_{AR(p)} + \epsilon_t \]

where \(\beta_0 = (1-\sum_{i=1}^{p}\phi_i)\mu\), and \(\epsilon_t\) is a white-noise process.

AR Model

\[ y_t = \phi_1y_{t-1} + \epsilon_t \]

❓ if \(\phi_1\) is 2, sketch out what the time series of \(y\) would look like

❓ if \(\phi_1\) is -10, sketch out what the time series of \(y\) would look like

❓ if \(\phi_1\) is 0.5, sketch out what the time series of \(y\) would look like

AR Model

Assume time time series of \(y_t\) is stationary.

\[ \mathbb{E}(y_t) = \mathbb{E}(\beta_0) + \mathbb{E}(\phi_1y_{t-1}) + \mathbb{E}(\epsilon_t) \]

\(\epsilon_t\) is a a white noise process so it’s expected value is \(0\).

\[ \mathbb{E}(y_t) = \beta_0 + \phi_1\mathbb{E}(y_{t-1}) \]

Since \(y_t\) and \(y_{t-1}\) are from the same stationary time series, \(\mathbb{E}(y_t) =\mathbb{E}(y_{t-1}) = \mu\).

AR Model

\[ \mu = \beta_0 + \phi_1\mu \\ \mu - \beta_0 = \phi_1\mu \\ \mu - \phi_1\mu = \beta_0 \\ \mu(1 - \phi_1) = \beta_0 \\ \mu = \frac{\beta_0}{(1 - \phi_1)} \\ \mathbb{E}(y_t) = \frac{\beta_0}{(1 - \phi_1)} \]

\(\mathbb{E}(y_t)\) is only defined if \(\phi_1 \neq 1\)

AR Model

\[ Var(\beta_0 + \phi_1y_{t-1} + \epsilon_t) = \\ \frac{\sigma^2}{1 - \phi_1^2} \]

(I’ll spare you the whole derivation) where \(\sigma^2\) is the variance of \(\epsilon_t\).

The variance \(\frac{\sigma^2}{1 - \phi_1^2}\) has to be positive, so \((1-\phi_1^2) > 0\) which is only true when \(|\phi_1| < 1\).

AR Model

\[ \gamma_p = Cov(y_t, y_{t-p}) = \mathbb{E}(y_t* y_{t-p}) = \\ \phi_1^p * \frac{\sigma^2}{1 - \phi_1^2} \]

The autocovariance is constant, as it does not depend on \(t\).

Note: autocorrelation is autocovariance/variance or \(\phi_1^p\)

AR Model

Characteristic Polynomial: for a zero-mean AR(p) process

\[ y_t = \phi_1By_t + \phi_2B^2y_t + ... + \phi_pB^py_t + \epsilon_t \\ y_t -\phi_1By_t - \phi_2B^2y_t - ... - \phi_pB^py_t = \epsilon_t \\ y_t (1- \phi_1B - \phi_2B^2 - ... - \phi_pB^p) = \epsilon_t \]

That looks like a polynomial in those parentheses! We can find the roots of a polynomial.

AR Model

For an \(AR(1)\):

\[ y_t(1-\phi_1B) = \epsilon_t \]

we’ll replace \(B\) with a variable \(z\)

\[ 1-\phi_1 z = 0 \]

AR Model

the roots of \((1-\phi_1 z)\) is:

\[ (1-\phi_1 z) = 0 \\ 1 = \phi_1 z \\ z = \frac{1}{\phi_1} \]

Thus, we can restate our stationarity condition as \(|root| = |\frac{1}{\phi_1}| > 1\)

This relationship holds even when we have higher order polynomials from an \(AR(p)\) model.

AR Model

AR Model in R

par(mfrow = c(1, 2))
plot(simulated_data, main = "AR(3)")
pacf(simulated_data, main = "Partial Autocorrelation Plot")

AR Model in R

ar3 <- arima(simulated_data, order = c(3,0,0))
print(ar3)

Call:
arima(x = simulated_data, order = c(3, 0, 0))

Coefficients:
         ar1     ar2     ar3  intercept
      0.4945  -0.687  0.6803    -0.0427
s.e.  0.0741   0.054  0.0732     0.1886

sigma^2 estimated as 0.9663:  log likelihood = -141.67,  aic = 293.33

AR Model in R

ts.plot(simulated_data)
AR_fit <- simulated_data - residuals(ar3)
points(AR_fit, type = "l", col = 2, lty = 2)

MA Model

An \(MA(q)\) model is an moving average model of a stationary time series with the form

\[ y_t = \beta_0 + \epsilon_t - \theta_1\epsilon_{t-1} - ...- \theta_q\epsilon_{t-q} \]

Thus, the current value predicted based on the error (white noise components) of previous time steps. You use your past error to help improve your current prediction.

MA Model

par(mfrow = c(1, 2))
ts.plot(simulated_data)
acf(simulated_data)

MA Model in R

ma15 <- arima(simulated_data, order = c(0,0,15))
print(ma15)

Call:
arima(x = simulated_data, order = c(0, 0, 15))

Coefficients:
         ma1      ma2      ma3     ma4     ma5      ma6     ma7     ma8
      0.5762  -0.4246  -0.0070  0.7360  0.2235  -0.4281  0.1256  0.5307
s.e.  0.1004   0.1222   0.1505  0.1261  0.1668   0.1390  0.1640  0.1558
          ma9     ma10    ma11    ma12     ma13    ma14    ma15  intercept
      -0.1653  -0.3613  0.5099  0.2945  -0.2178  0.0306  0.2341    -0.0737
s.e.   0.1581   0.1691  0.1607  0.1651   0.1459  0.1655  0.1135     0.2368

sigma^2 estimated as 0.8511:  log likelihood = -137.93,  aic = 309.87
ts.plot(simulated_data)
MA_fit <- simulated_data - residuals(ma15)
points(MA_fit, type = "l", col = 2, lty = 2)

MA Model in R

par(mfrow = c(1, 2))
plot(simulated_data, main = "MA(2)")
acf(simulated_data, main = "Autocorrelation Plot")

MA Model in R

ma3 <- arima(simulated_data, order = c(0,0,3))
print(ma3)

Call:
arima(x = simulated_data, order = c(0, 0, 3))

Coefficients:
         ma1      ma2     ma3  intercept
      0.4198  -0.0222  0.0864     0.0617
s.e.  0.1010   0.1064  0.1127     0.1644

sigma^2 estimated as 1.236:  log likelihood = -152.62,  aic = 315.23

MA Model in R

ts.plot(simulated_data)
MA_fit <- simulated_data - residuals(ma3)
points(MA_fit, type = "l", col = 2, lty = 2)

ARMA Model

An \(ARMA(p,q)\) model has both an \(AR(p)\) and \(MA(q)\) component. An \(ARMA(1,1)\) would look like:

\[ y_t = \beta_0 + \underbrace{\phi_1y_{t-1}}_{AR} + \underbrace{\theta_1\epsilon_{t-1}}_{MA} + \epsilon_t \]

ARMA Model in R

ma3 <- arima(simulated_data, order = c(1,0,3))
print(ma3)

Call:
arima(x = simulated_data, order = c(1, 0, 3))

Coefficients:
         ar1     ma1      ma2     ma3  intercept
      0.2984  0.1228  -0.1441  0.1030     0.0649
s.e.  0.9962  0.9909   0.4388  0.1117     0.1710

sigma^2 estimated as 1.235:  log likelihood = -152.57,  aic = 317.15

ARMA Model in R

ARIMA

ARIMA models have three important components:

  • AR: autoregressive component

  • I: integrated (stationarity)

  • MA: moving average component

\(ARIMA(p,d,q)\) refers to a model with an Auto Regressive component of \(p\), an integrated component of \(d\), and a Moving Average component of \(q\)

ARIMA

\[ x_t = \beta_0 + \beta_1*t + \epsilon_t \]

💡 what if we used \(x_t\) to create a new time series, \(z_t\)?

\[ z_t = x_t - x_{t-1} = \\ \left[\beta_0 + \beta_1*t + \epsilon_t\right] - \left[\beta_0 + \beta_1*(t-1) + \epsilon_{t-1}\right] = \\ \beta_1 + (\epsilon_t - \epsilon_{t-1}) \]

ARIMA

But what if \(z_t\) is not stationary either?

\[ w_t = z_t - z_{t-1} \]

And just keep going \(d\) times until the time series is stationary.

ARIMA

ARIMA

ARIMA in R

ma3 <- arima(simulated_data, order = c(1,1,3))
print(ma3)

Call:
arima(x = simulated_data, order = c(1, 1, 3))

Coefficients:
         ar1     ma1      ma2      ma3
      0.3560  0.6065  -0.3365  -0.2023
s.e.  0.1741  0.1700   0.1613   0.0354

sigma^2 estimated as 0.9149:  log likelihood = -1375.18,  aic = 2760.36

ARIMA in R

Gaussian Processes for Time Series

Gaussian: a multivariate normal distribution

\[ X \sim \mathcal{N}(\mu, \Sigma) \]

where \(X = (x_1, x_2,..., x_p)\)

\(\Sigma\) tells you how the various components of \(X\) are related (aka how they covary).

Gaussian Processes for Time Series

let’s take \(X\) and split it into two groups of data, \(\begin{bmatrix}X\\Y \end{bmatrix}\)

\[ \begin{bmatrix}X\\Y \end{bmatrix} \sim \mathcal{N}(\mu, \Sigma) \\ \mu = \begin{bmatrix}\mu_X\\\mu_Y \end{bmatrix}; \sigma = \begin{bmatrix}\Sigma_{X,X} & \Sigma_{X,Y}\\\Sigma_{Y,X} & \Sigma_{Y,Y} \end{bmatrix} \]

  • marginalizing: integrating over \(X\) (aka summarizing over all values of \(X\)) , what is the distribution of \(Y\) (or vice versa)

  • conditioning: \(X | Y\) given a value of \(Y\), what is the distribution of \(X\) (or vice versa)

visual

Gaussian Processes for Time Series

What if, like in our other time series models, the items in our vector \(\begin{bmatrix}X\\Y \end{bmatrix}\) are different time points?

\[ \begin{bmatrix}X\\Y \end{bmatrix} = (x_1,x_2,...x_n, y_1, y_2, ..., y_n) \]

Kernels

The kernel of a GP defines the covariance between any two time points.1 In other words, how much influence the points have on each other.

from: https://distill.pub/2019/visual-exploration-gaussian-processes/

Kernels

\[ \Sigma = Cov(x,x') = k(x,x') \]

For example:

\[ k(x,x') = \sigma^2 e^{- \frac{(x-x')^2}{2l}} \]

where \(l\) and \(\sigma\) are hyperparameter that you choose.


and I spend so much of my day choosing them…istg

Kernels

We could use \(k(x,x')\) to “fill out” the covariance matrix \(\Sigma\) which tells us how each data point influences (or covaries with) each other data point.

Kernel Simulation

Let’s say we want to look at 100 data points at \(t = 0, t = 1, ...., t = 99\). The kernel gives us information about how these 100 time points covary.

Using the squared exponential kernel we talked about before, we can generate a few different 100-point time series with:

  • \(\mu = 0\) (for convenience)
  • \(\Sigma =\sigma^2 e^{- \frac{(x-x')^2}{2l}}\)

Kernel Simulation

Knowing \(\mu\) and \(\Sigma\) we can sample from a \(\mathcal{N}(0,\Sigma)\) distribution.

Kernel Simulation

Conditioning on Observed Data

Alone, the kernel acts as a prior that encodes our assumptions about the shape of the GP. Now, we incorporate the observed data by conditioning on it.

Conditioning on Observed Data

Seasonal GP Kernels

Gaussian Processes in R

library(kernlab)
# Generate synthetic time series data
set.seed(540)
time <- seq(0, 10, length.out = 100)
y <- sin(time) + rnorm(100, sd = 0.2) 

# Fit a Gaussian Process model using the radial basis function kernel
gp_model <- gausspr(time, y, kernel = "rbfdot", var = 0.1, variance.model = TRUE)

# Predict using the fitted Gaussian Process model
time_test <- seq(0, 14, length.out = 1000)
y_pred <- predict(gp_model, time_test)
y_up <- y_pred + 2*predict(gp_model, time_test, type="sdeviation")
y_low <- y_pred -2*predict(gp_model, time_test, type="sdeviation")

Gaussian Processes in R

Using automatic sigma estimation (sigest) for RBF or laplace kernel